The bacterial density of clinical rectal swabs is highly variable, correlates with sequencing contamination, and predicts patient risk of extraintestinal infection

Background In ecology, population density is a key feature of community analysis. Yet in studies of the gut microbiome, bacterial density is rarely reported. Studies of hospitalized patients commonly use rectal swabs for microbiome analysis, yet variation in their bacterial density—and the clinical and methodologic significance of this variation—remains undetermined. We used an ultra-sensitive quantification approach—droplet digital PCR (ddPCR)—to quantify bacterial density in rectal swabs from 118 hospitalized patients. We compared bacterial density with bacterial community composition (via 16S rRNA amplicon sequencing) and clinical data to determine if variation in bacterial density has methodological, clinical, and prognostic significance. Results Bacterial density in rectal swab specimens was highly variable, spanning five orders of magnitude (1.2 × 104–3.2 × 109 16S rRNA gene copies/sample). Low bacterial density was strongly correlated with the detection of sequencing contamination (Spearman ρ = − 0.95, p < 10−16). Low-density rectal swab communities were dominated by peri-rectal skin bacteria and sequencing contaminants (p < 0.01), suggesting that some variation in bacterial density is explained by sampling variation. Yet bacterial density was also associated with important clinical exposures, conditions, and outcomes. Bacterial density was lower among patients who had received piperacillin-tazobactam (p = 0.017) and increased among patients with multiple medical comorbidities (Charlson score, p = 0.0040) and advanced age (p = 0.043). Bacterial density at the time of hospital admission was independently associated with subsequent extraintestinal infection (p = 0.0028), even when controlled for severity of illness and comorbidities. Conclusions The bacterial density of rectal swabs is highly variable, and this variability is of methodological, clinical, and prognostic significance. Microbiome studies using rectal swabs are vulnerable to sequencing contamination and should include appropriate negative sequencing controls. Among hospitalized patients, gut bacterial density is associated with clinical exposures (antibiotics, comorbidities) and independently predicts infection risk. Bacterial density is an important and under-studied feature of gut microbiome community analysis. Video abstract Supplementary Information The online version contains supplementary material available at 10.1186/s40168-021-01190-y.

We designed a retrospective cohort study using a secondary analysis of clinically collected rectal swabs from hospitalized patients. We used hospital admission swabs previously collected, processed, and analyzed for a study of gut microbiome risk factors for Vancomycin resistant Enterococcus (VRE) acquisition in 118 patients admitted to the University of Michigan Hospital in 2016-2017 1 . In the prior study, used 236 rectal swab samples from 59 matched pairs to the study of gut microbiota of case and control subjects admitted to the University of Michigan Hospital during the study period. The infection control practice throughout the study period was to perform routine surveillance for VRE using rectal swabs on eight adult hospital units, including intensive care units, the hematology and oncology ward, and the bone marrow transplant ward. All hospitalized patients had routine collection of rectal swabs on admission and weekly thereafter to screen for VRE. In the prior study, cases were defined as subjects with an initial negative swab followed by a positive swab when evaluated by selective culture. We identified the "time at risk" for each case patient, defined as the time elapsed between admission and positive VRE screen. We matched each case subject to a control subject with an initial negative swab followed by repeat negative swab within the same time at risk (±5%). An additional matching factor was the unit from which the first positive VRE was recovered for cases or the matched swab after the time at risk for controls. For the current study, we restricted our analysis to admission rectal swabs (one swab per patient). We preformed an analysis on the entire cohort without reference to VRE colonization status.

Bacterial DNA isolation:
DNA isolation was performed with a single kit according to a modified protocol previously demonstrated to isolate bacterial DNA 5 . Briefly, rectal swab specimens were re-suspended in 360 µl ATL buffer (cell lysis solution, Qiagen DNeasy Blood & Tissue kit, catalog no. 69506) and homogenized in PowerBead Tubes (Qiagen, Hilden, Germany, catalog no. 13123-50).. ZymoBIOMICS Microbial Community DNA Standard (Zymo Research cat# D6306) was sequenced as a positive control. Sterile laboratory water, AE buffer (solution of 10 mM Tris-Cl 0·5 mM in EDTA; pH 9·0), and extraction control specimens were collected and analyzed as potential sources of contamination (negative controls).

Bacterial density quantification
Bacterial DNA quantification Bacterial DNA was quantified using a QX200 Droplet Digital PCR System (BioRad, Hercules, CA). The technique partitions a single sample into 20,000 droplets. A standard PCR reaction then amplifies 16S specific cDNA in each droplet, and each droplet is individually counted by the associated target dependent fluorescence signal as positive or negative. This allows for absolute 16S copy number quantification sample without generating a standard curve18-20. Primers and cycling conditions were performed according to a previously published protocol20. Specifically, primers were 5'-GCAGGCCTAACACATGCAAGTC-3' (63F) and 5'-CTGCTGCCTCCCGTAGGAGT-3' (355R). The cycling protocol was 1 cycle at 95°C for 5 minutes, 40 cycles at 95°C for 15 seconds and 60°C for 1 minute, 1 cycle at 4°C for 5 minutes, and 1 cycle at 90°C for 5 minutes all at a ramp rate of 2°C/second. The BioRad C1000 Touch Thermal Cycler was used for PCR cycling. Droplets were detected using the automated droplet reader (Bio-Rad, catalog no. 1864003), quantified using Quantasoft TM Analysis Pro (version 1.0.596), and imported to R for visualization and statistical analysis.

16s rRNA gene sequencing
The V4 region of the 16s rRNA gene was amplified using published primers and the dual-indexing sequencing strategy described previously 2 . Sequencing was performed using the Illumina MiSeq platform (San Diego, CA), using a MiSeq Reagent Kit V2 (500 cycles), according to the manufacturer's instructions with modifications found in the standard operating procedure of the laboratory of Dr. Patrick Schloss 3,4 Sequencing reagents were prepared according to the Schloss SOP and custom read 1, read 2, and index primers were added to the reagent cartridge. Amplicons were sequenced using the Illumina MiSeq platform (San Diego, CA) using a MiSeq Reagent Kit V2 (Illumina, catalog no. MS102-2003) for 500 cycles. A synthetic community (n=4; ZymoBIOMICS Microbial Community DNA Standard, Zymo Research catalog no. D6306) was sequenced as a positive control. Sterile laboratory water (n=8), AE buffer (solution of 10 mM Tris-Cl 0·5 mM in EDTA; pH 9·0, [n=6]) used in DNA isolation, and extraction control specimens (n=6),were collected and analyzed as potential sources of contamination (negative controls). FASTQ files were generated with paired end reads and retained for further analysis.

Adequacy of sequencing.
We performed 16S rRNA gene amplicon sequencing on 236 rectal swab specimens and 15 negative-control specimens, which identified 1,188 unique operational taxonomic units (genus-level bacterial taxa) at a dissimilarity threshold of 3%. After bioinformatics processing, the mean number of reads per sample was 71,484 ± 2,684. No specimens were excluded from the analysis.

16S Gene analysis:
16S rRNA gene sequencing data were processed using mothur (v. 1.43.0) according to the Standard Operating Procedure for MiSeq sequence data using a minimum sequence length of 250 base pairs 4 . To summarize, the SILVA rRNA database 5 (v. 132, silva.nr_v132.regionV4.align) was used as a reference for sequence alignment and taxonomic classification. K-mer searching with 8mers was used to assign raw sequences to their closest matching template in the reference database, and pairwise alignment was performed with the Needleman-Wunsch 6 and NAST algorithms 7 . A k-mer-based naive Bayesian classifier 8 was used to assign sequences to their correct taxonomy with a bootstrap confidence score threshold of 80. Pairwise distances between aligned sequences were calculated by the method employed by Sogin et al. 2 where pairwise distance equals mismatches, including indels, divided by sequence length. A distance matrix was passed to the OptiCLUST clustering algorithm 9 to cluster sequences into "operational taxonomic units" (OTUs) by maximizing the Matthews correlation coefficient with adissimilarity threshold of 3% 10 . OTU numbers were arbitrarily assigned in the binning process and are referred to throughout the manuscript in association with their most specified level of taxonomy (typically genus or family). OTUs were classified using the mothur implementation of the Ribosomal Database Project (RDP) classifier and RDP taxonomy training set 16 (trainset16_022016.rdp.fasta, trainset16_022016.rdp.tax), available on the mothur website 4 . After clustering and classification of raw sequencing data, we evaluated differences in community structure with permutational multivariate analysis of variance (PERMANOVA) in the vegan package (v2.0-4) 11 in R (v 3.6.4) 12 . We performed resampling of multiple generalized linear models with the mvabund 13 package in R to look for individual OTU differences between communities. We set a significance threshold of 0.01 after adjusting for multiple comparisons using a stepdown resampling procedure to reduce the type I error rate 14 . We confirmed individual OTU differences with random forest classification and regression models built with the ranger package in R (v 0.11.2) 15 . We used the caret (v 6.0-84) 16 package in R for cross-validation and to optimize the hyperparameters of the number of decision trees in the model and the number of features considered by each tree when splitting a node. We corrected for feature importance bias in random forest models with a permutation importance (PIMP) heuristic developed by Altmann et al. 17 .

Clinical metadata:
We collected data from the electronic medical record to describe host health both by the severity of the acute illness that prompted hospitalization and by the severity of chronic disease before hospitalization. We measured acute illness and chronic disease with the validated Sequential Organ Failure Assessment Score (SOFA score) 18-20 and Charlson comorbidity index 21-23 , respectively. We collected data on the antibiotic exposure of patients in the Emergency Department prior to collection of their initial rectal swab. 116 of the 118 rectal swabs in this cohort belonged to patients with accessible clinical metadata through the electronic medical record and were included in our analysis. 2 rectal swabs belonged to patients with sensitive information inaccessible through the electronic medical record and outside of the scope of our Institutional Research Board approval. Thus, only 116 of 118 subjects were included in the clinical metadata analysis.
We used infection-free survival to study the prognostic significance of bacterial density on rectal swabs. We defined extra-intestinal infection as the growth of a bacterial organism by traditional culture media in a site considered by clinicians to be "sterile" (blood, urine, ascites fluid, cerebrospinal fluid, sputum, deep tissue culture) meeting clinical criteria set by major medical societies and the Centers for Disease Control and Prevention for bacterial peritonitis 24 , urinary tract infection 25,26 , pneumonia 27-29 , skin and soft tissue infection 30 , and bacteremia 29 . Clinical adjudication of positive culture growth led to categorization as colonization without infection, contamination, or clinical infection.
We reviewed the admission history and physical documentation as well as the hospital discharge summary to determine the admitting diagnosis for patients in the cohort. We broadly classified admitting diagnoses into 7 categories: cardiopulmonary disorder (which included congestive heart failure, myocardial infarction, and respiratory failure not attributable to pneumonia, and post-operative ICU stay after major cardiac surgery); primary neurologic disorder (which included intracranial hemorrhage, ischemic stroke, or post-operative recovery after major neurosurgery), Sepsis syndrome (defined as a presumed infection on admission requiring the use of antibiotics), gastrointestinal disruption (which included inflammatory bowel disease, pancreatitis, bowel obstruction or perforation, or post operative status after major gastrointestinal surgery), trauma, non-infectious complications of chemotherapy (which included acute renal injury, cytopenia without the presence of neutropenic fever, and nausea and vomiting attributable to chemotherapy), and non-infectious complications of bone-marrow transplantation (which included graft versus host disease as well as nausea and vomiting in the absence of recent chemotherapy administration)

Statistical analysis of clinical metadata:
All analyses were performed using the R programming statistical programming language (v 4·0·2) 12 . A multivariate linear regression model using clinical covariates to predict log transformed bacterial density was built with the stats package in R 12 . We constructed Kaplan-Meier curves to determine the median infection free survival in subjects above and below a critical threshold of 10 6 16S copies/sample. We used a stratified log-rank statistic to determine the statistical significance of differences in infection free survival between groups. After checking the proportional hazards assumption, we built Cox proportional hazards models incorporating bacterial density and clinical covariates were built to predict infection free survival. All survival analysis was done with the survival 31 (v 3·1-8) package in R .Pairwise significance was determined as appropriate by the Wilcoxon test with the Benjamini-Hochberg correction for multiple comparisons, Tukey's HSD test, and two-sample independent Mann-Whitney U test. All statistical tests used p=0.05 as a threshold for significance.
Appendix Table 1: Univariate comparisons of difference in bacterial density by demographics and comorbidities Appendix Table 2: Total antibiotic exposure in the cohort Appendix Table 3: Summary statistics of bacterial density by hospital unit Appendix Table 4: Tukey HSD comparisons of bacterial density by unit of admission Appendix Table 5: Alternative linear mixed effects model of features associated with bacterial density (log 16S copies/specimen) including unit of admission and mechanically ventilated status Appendix Table 6: Composite outcomes in the cohort Appendix Table 7: Pathogens isolated in cohort Appendix Table 8: Alternative multivariable frailty model of features associated with bacterial infection with unit of admission and mechanically ventilated status included Appendix Table 9: Features driving separation in community structure identified by random forest achieving significance after correcting for feature importance bias Supplemental Figure 1 Bacterial density by hospital unit Appendix